Brian S. Evans, Ph.D.
Migratory Bird Center
Smithsonian Conservation Biology Institute
In this lesson we will explore for loops and functions. A for loop is a block of code that allows you to repeat an operation a set number of iterations. A function is a set of code that allows you to create or modify objects in a repeatable way. As data analysis often includes many repeated tasks, for loops and user-defined functions can save considerable time, shorten R scripts, and make your scripts more legible. Understanding how they work is crucial to developing good coding practices.
For this lesson, we will use three example data frames: Iris sepal and petal measurements collected by botanist Edgar Anderson in the 1920’s, bird traits collected by Wilman et al. (2014), and bird point count observations from Washington, DC (Evans et al. 2016).
Open R Studio and run the following lines of code to load the data into R:
#=================================================================================*
# ---- set up ----
#=================================================================================*
# Load libraries
library(RCurl)
library(tidyverse)
# Note: If you have yet to install these library, please do so with:
# install.packages('rCurl') ; install.packages('tidyverse')
# Provide the web addresses of the files (note: iris is preloaded example data):
habitsURL <- getURL('https://raw.githubusercontent.com/bsevansunc/workshop_languageOfR/master/birdHabits.csv')
countsURL <- getURL('https://raw.githubusercontent.com/bsevansunc/workshop_languageOfR/master/birdCounts.csv')
# Read in the data:
birdHabits <- tbl_df(read.csv(text = habitsURL))
birdCounts <- tbl_df(read.csv(text = countsURL))
Take a moment to explore the data:
# Explore the data (birdHabits example):
head(birdHabits)
dim(birdHabits)
attributes(birdHabits)
str(birdHabits)
summary(birdHabits)
Have a look at the built-in iris dataset as well:
# Explore the iris data:
head(iris)
dim(iris)
attributes(iris)
str(iris)
summary(iris)
For the iris dataset, I think it’s best to do some familiar cleaning steps:
# Clean up iris for analysis:
irisTbl <- tbl_df(iris)
names(irisTbl) <-
c('sepalLength',
'sepalWidth',
'petalLength',
'petalWidth',
'species')
Why would you use for loops? Let’s look at a common example. You have been asked to calculate the mean petal length of three Iris species: Iris setosa, Iris versicolor, and Iris virginica. (coded as the factor levels setosa, versicolor, and virginica in the species field of the irisTbl data frame). Using the mean function and tools that we have addressed thus far, use the following to calculate the mean petal length for each species:
# Filter irisTbl to setosa:
irisTbl[irisTbl$species == 'setosa', ]
# Extract the petalLength field (column):
irisTbl[irisTbl$species == 'setosa', ]$petalLength
# Calculate the mean of petal lengths:
mean(irisTbl[irisTbl$species == 'setosa', ]$petalLength)
Exercise One:
Calculate the mean petal length of each of the Iris species
The code you generated above is very repetitive – the only change in each of the lines you should have created above was the name of the species. Writing code like this makes your scripts unnecessarily long and prone to user input error. For loops should be used whenever a chunk of code is repeated more than twice.
We’ll return to the Iris example a bit later, but for now let’s start with a very simple for loop.
Writing proper for loops requires following these three steps:
Let’s use some example data. We’ll construct a vector, v using a set of five numbers.
v <- c(1,1,2,3,5)
v
We would like to modify the values in vector v by adding one to each value. This might be written mathematically as:
Recall that value v[i] is equal to the value at position i in vector v. Let’s take a look at the value of v at position 3:
v[3]
i <- 3
v[i]
v[3] == v[i]
If we want to calculate v + 1 at position 3, we would write:
v[3] + 1
For loop development begins by defining an object of a given length that your for loop will write to. This step is very important – your for loop will run much, much slower if you do not do so!
Vector objects are defined as follows:
vNew <- vector('numeric', length = length(v))
str(vNew)
The first argument of the vector function is the type of object you would like to create. The next argument sets the length of the created object to be the equivalent to that of v.
Values of vector vNew will be replaced location-by-location. For example, let’s compare the initial value of v with the resultant value of v + 1 at position 3:
i <- 3
v[i]
vNew[i] <- v[i] + 1
vNew[i]
v[i] + 1 == vNew[i]
The utility of the for loop is that we can calculate the above for each position (i) in vector v. This is done by setting the “for loop sequence” statement which defines the locations over which the for loop will be calculated. The for loop sequence for locations one through five is written as (DO NOT RUN):
# for(i in 1:5)
The above can be translated as “for position i in positions one through five”.
To make our code as flexible as possible, we generally do not want to have to directly type in the stopping point of the for loop. We can use 1:length(v) or the function seq_along to specify the for loop address. Run the following and note the behavior:
v
1:5
1:length(v)
seq_along(v)
The for loop sequence statement can then be written as (DO NOT RUN):
# Example for loop sequence statements:
# for(i in 1:length(v))
# for(i in seq_along(v))
The above statements are more flexible than providing a numeric stopping point for your for loop.
The for loop sequence statement is followed by the “body” statement that tells R what to do during each iteration of the loop. The body associated with our “add one” formula is (DO NOT RUN):
vNew[i] <- v[i] + 1
If your for loop spans multiple lines, place the body within curly brackets, for example (DO NOT RUN):
# for(i in 1:length(x)){
# body
# }
Putting together our steps of: 1) Creating an output object, 2) The for loop sequence statement, and 3) The body statement, our completed for loop is written as follows (RUN THIS ONE):
vNew <- numeric(length = length(v))
for(i in seq_along(v)){
vNew[i] <- v[i] + 1
}
Take a look at the output and compare it with the values of v:
vNew
v
vNew == v
We specify the length of the vector to provide R with stopping rules – without this for loops can become very memory hungry when running over large datasets
The following two sections of code are equivalent, but the latter much easier to read. As writing code is both a tool and a method of communication, you should ensure that your code is as readable as possible.
for(i in 1:length(v)) vNew[i] <- v[i] + 1
for(i in 1:length(v)){
vNew[i] <- v[i] + 1
}
When writing for loops, it is necessary to ensure that the loop is doing what you expect it to do. A simple way to ensure that this is the case is to specify i and run the instructions on just that value. For example, to observe the behavior of the for loop at position 3:
i = 3
vNew[i] <- v[i] + 1
vNew[i]
v[i]
Exercise Two:
- Use a for loop to test whether each species of bird in the
birdHabitsdata frame is an omnivore (logical test).- Calculate the total number of records associated with omnivorous birds.
You may have noticed that none of the operations we completed in the previous section actually required for loops (for example, v + 1 is calculated for each value of v by default). The Iris example at the start of this lesson, however, highlights their utility for summarizing data. Calculating the mean for each species without a for loop required the following code:
# Mean petal lengths of Iris species without a for loop:
mean(irisTbl[irisTbl$species == 'setosa', ]$petalLength)
mean(irisTbl[irisTbl$species == 'versicolor', ]$petalLength)
mean(irisTbl[irisTbl$species == 'virginica', ]$petalLength)
We can use a for loop to calculate this value across species. To do so, we first need to create a vector of species:
# Make a vector of species to loop across:
irisSpecies <- levels(irisTbl$species)
irisSpecies
Next, we’ll create an empty vector to store our output:
# For loop output statement:
petalLengths <- vector('numeric',length = length(irisSpecies))
petalLengths
The vector of Iris species will define the bounds of our for loop sequence (DO NOT RUN):
# For loop sequence:
# for(i in seq_along(irisSpecies))
Our for loop body is constructed as:
# Exploring the construction of the for loop body:
i <- 3
irisSpecies[i]
irisTbl[irisTbl$species == irisSpecies[i], ]
irisTbl[irisTbl$species == irisSpecies[i], ]$petalLength
mean(irisTbl[irisTbl$species == irisSpecies[i], ]$petalLength)
# The for loop body:
petalLengths[i] <- mean(irisTbl[irisTbl$species == irisSpecies[i], ]$petalLength)
Putting it all together, the for loop would be written as:
# Make a vector of species to loop across:
irisSpecies <- levels(irisTbl$species)
# For loop output statement:
petalLengths <- vector('numeric',length = length(irisSpecies))
# For loop:
for(i in seq_along(irisSpecies)){
petalLengths[i] <- mean(irisTbl[irisTbl$species == irisSpecies[i], ]$petalLength)
}
We can turn our result into a tidy tibble data frame, using the data_frame function:
petalLengthFrame <- data_frame(species = irisSpecies, count = petalLengths)
petalLengthFrame
Exercise Three:
Use a for loop and the
birdHabitsdata frame to calculate the number species in each diet guild.
We may also be interested in using a for loop to generate a vector of numbers based on some mathematical operation. For example, consider we have a value, 10, and want to calculate the mathematical expression:
\[n_t = 2(n_{t-1})\]
We must first start with a “seed” value for our vector. This is the initial value of vector n.
n <- 10
In this instance, because we will store our resultant values within the same vector, n, it is unnecessary to create a blank vector to store the results.
Let’s calculate the first five values in this series. Because our for loop starts with our seed value, we are only interested in the second through fifth positions of this vector. Thus, our for loop initialization statement is (DO NOT RUN):
for(i in 2:5)
And our complete for loop statement becomes:
n = 10
for(i in 2:5){
n[i] = n*v[i-1]
}
Exercise Two:
One of my favorite for loops was created by Leonardo Bonacci (Fibonacci). He created the first known population model, from which the famous Fibonacci number series was created. He described a population (N) at time t as the sum of the population at the previous time step plus the time step before that:
\[N_t = N_{t-1} + N_{t-2}\]
- Use the combine function to create a seed vector of two values, zero and one.
- Use the formula above and your seed vector to generate the first 20 numbers of the Fibonacci number sequence. Hint: The for loop initialization will begin at third position – it will NOT include all of 1:20!
In the introductory lesson and above, we have worked with several functions (e.g., “c”, “mean”). Functions allow you to simplify complex tasks, which is especially useful if you need to run a task multiple times. Program R contains many functions and many more still are created by R users and provided to the R community as collections of functions known as packages (or libraries). As datasets and data handling needs are often distinct, relying exclusively on built-in and community-defined functions may be limiting. Learning how to create your own functions, or customize existing functions, provides you with the flexibility to solve unique problems, shorten your script, and make your R code more legible for others.
Writing your own functions is easy, as long as you follow the correct syntax. The basic structure is:
theNameOfMyFunction <- function(objectToComputeFunctionFor) {
What you would like to happen when you run your function
}
Let’s write a function that will do the same thing as our first for loop – add 1 to every value in our vector V0.
addOneFun <- function(x){
x+1
}
Let’s run the addOneFun function and compare it to our values in V1 above:
addOneFun(V0)
V1
Exercise Four:
The mathematical formula for standard error is provided below. Convert this to an R function (Note: the function for standard deviation is “sd” and the function for square root is “sqrt”): \[StdErr (x) = \frac{StDev(x)}{\sqrt{n}}\]
In the for loops section above, we subset the birdHabits frame by the aerial foraging guild using:
birdHabits[birdHabits$foraging == 'aerial',]
If it is necessary to repeat this operation multiple times, we can simplify our scripts and make them more readable by wrapping our subsetting operation within a function. Below, we create a function that subsets by foraging guild and extract the data using the character value for that guild:
foragingSubsetFun <- function(foragingGuild){
birdHabitsSubset <- birdHabits[birdHabits$foraging == foragingGuild,]
return(birdHabitsSubset)
}
foragingSubsetFun('aerial')
Likewise, we may be interested in calculating the sum of counts for a given guild. Above we calculated the summed counts for a given guild as:
sum(birdHabits[birdHabits$foraging == 'aerial','count'])
We can embed this line of code within a function to extract the sum of counts for a given foraging guild:
foragingSumFun <- function(foragingGuild){
birdHabitsubset <- birdHabits[birdHabits$foraging == foragingGuild,'count']
return(sum(birdHabitsubset))
}
foragingSumFun('aerial')
We could also sum the count field by nesting the previous subsetting function within a new function:
foragingSumFun <- function(foragingGuild){
birdHabitsubset <- foragingSubsetFun(foragingGuild)
return(sum(birdHabitsubset$count))
}
foragingSumFun('aerial')
We can make our subsetting and summing functions more flexible by allowing us to choose which trait (e.g., diet or foraging) and guild we would like to count:
traitGuildSubsetFun <- function(trait, guild){
birdHabitsubset <- birdHabits[birdHabits[,trait] == guild,]
return(birdHabitsubset)
}
traitGuildSumFun <- function(trait, guild){
birdHabitsubset <- traitGuildSubsetFun(trait, guild)
return(sum(birdHabitsubset$count))
}
traitGuildSumFun('foraging','aerial')
traitGuildSumFun('diet','insectivore')
Exercise Five:
Create a function that calculates the mean count for a given diet guild across sites. Hint: Mean count is the sum of counts divided by the total number of sites in the birdHabits data frame. Do not use the built-in “mean”" function. Why can’t we use the mean function?
Create a function that calculates the mean count for a selected trait (e.g. diet or foraging) and guild across sites.
You can use for loops and functions to examine fundamental theoretical models in ecology. (Only) If you’re feeling super comfortable with for loops and functions, try out these additional exercises: